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(54) TiUe: METHOD AND SYSTEM FOR CREDIT-BASED DATA FLOW CONTROL 



(57) Abstract 

Methods and systems for controlling data 
flow between a sender and a receiver include 
communicating credit lists to the sender. The 
credit lists include credits indicative of receive 
buffer sizes accessible by the receiver and ca- 
pable of receiving data. The sender transmits 
data packets to the receiver. The data packets 
are preferably no greater in size than the cred- 
its specified in the credit list. When the sender 
uses all of the credits, the sender preferably re- 
frains from sending data packets to the receiver 
until the supply of credits is replenished by the 
receiver. Because data flow t)etween die sender 
and the receiver is regulated using credits, the 
likelihood of data overflow errors is reduced and 
communication efficiency is increased. 
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METHOD AND SYSTEM FDR CREDIT-BASED DATA FLOW CONTROL 

5 This application claims the benefit of U.S. Provisional Patent Application No. 

60/095^97 filed August 4, 1998» the disclosure of which is incoiporated herdn by 
reference in its entirety. 

TECHNirAI. FIELD 

10 . 

The present invention relates to methods and systems for controllmg data flow 

between sending and receiving processes executing on one or more computers. More 
particularly, the present invention relates to mefliods and systems for controlling data 
flow between a sender and a receiver, each includmg one or more computer processes, by 
15 communicating credits fix)m the receiver to the sender indicating receive buffer sizes vwth 
reduced copying of data between sending and receivmg applications. 

ttACKGROTIND OF THF. INVENTION 

In computer communicadon systems, it is desirable to control the flow of data 
ftom a sending process to a receiving process. For example, if a sending process sends 
data to a receiving process fester than the receiving process can receive and process the 
data, data may be lost or overwritten. Similarly, if a sending process sends data and the 
receiving process foils to provide a buffer to receive the data, the connection between the 
sending and receiving processes may be broken. 

In conventional flow control techniques, such as TCP flow control techniques, 
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data flow is regulated between TCP buffers at the transport level. More particularly, TCP 
protocol software may utilize a sUding window to control flow between a sender's TCP 
buffer and a receiver's TCP buffer. According to TCP flow control, the sender maintains 
one window to monitor data segments that have been sent to the receiver and 
5 acknowledged, data segments that have been sent and not acknowledged, and data 

segments that have not been sent The receiver maintams a similar window to reassemble 
the data in the receiver's TCP buffer. When a receiving appUcation reads data from the 
receiver's TCP buffer, the data is copied from the receiver's TCP buffer to an application- 
level receive buffer and new data can be received in the TCP buffer. Thus, in order to 
1 0 regulate flow between a TCP sender and a TCP receiver, it is only necessary that the 

receiver communicate the size of the TCP buffer to the sender, rather than the size of the 

application-level buffers. 

The communication of the TCP buffer size to a TCP sender is accomplished 
through acknowledgement packets sent from the receiver to the sender. Each 
15 acknowledgement packet acknowledges a specific data segment sent from the sender to 
the receiver. Each acknowledgement includes a size field advertiang the size of the 
receiver's TCP buffer to the sender. The sender adjusts its window according to the 
advertised size and sends no more data than the current window size permits. Thus, once 
the sender fills the current window and sends the data to the receiver, the sender waits for 
2 0 acknowledgement packets from the receiver mdicating that the recdver-s TCP buffer has 
been emptied and more data can be sent This waiting may be undesirable, since the 
acknowledgement packets may be delayed due to network congestion. 
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Another problem with conventional TCP flow control methods is that the TCP 
buffer size information communicated by a TCP receiver may not reflect the actual 
available TCP buffer size. For example, conventional TCP protocol software may 
advertise to the sender an upper limit on the number of bytes that aTCP buffer is capable 
5 ofreceivmg. This upper limit may not reflect the actual memory space reserved for the 
TCP buffer when data arrives ftom the sender. Thus, conventional flow control methods 
may not communicate accurate buffer size information to the sender. 

Yet another problem with TCP flow control methods is that the copying of data 
between the TCP buffers and the sending and receiving application buffers introduces 
10 latency into data transfers. As a result of this latency, these methods may not be feasftle 
in high-speed enviromnents. such as system area networks (SANs). For example, in 
TCP. data may be copied from a sender's application-level buffer to the sender's TCP 
buffer and from a receiver's TCP buffer to the receiver's application-level buffer. Tids 
copying may have a dgnificant impact on I/O performance in high-speed environments. 
15 In order to increase I/O perfomiance over conventional communications 

protocols, some communication protocols, such as the Virtual Interfece Architecture 
(VIA), do not buffer data for an appUcation or perform fragmentation and reassembly of 
data. Data is sent from a sending I/O device, over a network, and received directiy into 
an ^plication-level receive buffer ofareceiver.Ifasenderutili^g the VIA archit«^^ 

2 0 attempts to send data when a receive buffer is not avdlable. comxection between the 
sender and receiver is broken. The breaking of a connection is a catastrophic, 
^^recoverable error, that requires reestabUshment of the comiection and resendmg of the 
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data. Similarly, when a sender utilizing VIA sends more data than a receive buffer can 
hold, or a larger buffer than the maximum transfer unit (MTU) of the network, connection 
may also be broken. When a sender sends an amount of data smaller than the size of a 
receive buffer, communication is not broken. However, sending less data than the 
5 receiver is capable of receiving may be inefficient. TCP flow control methods may be 
unsuitable for solving these problems because of the latency introduced by copymg. 
fragmentation, and reassembly, and because TCP flow control methods are based on TCP 
buffer size, rather than application buffer size. Thus, there exists a need for methods and 
systems for controlUng flow between a sender and a receiver that alleviate the difficulties 
1 0 "vwth conventional flow control techniques. 

RTTMMARY Q V THE INVENTION 

The present invention includes methods and systems for controUing flow of data 
over a connection, preferably a reliable connection, between a sender and a receiver, 
1 5 while reducing the need for copying of data. As used herem. the term "sender" is 

intended to refer to one or more processes that communicate with a receiver, which also 
includes one or more processes. The sender and the receiver may execute on the same 
computer or on separate computers. The tenns "sender" and "receiver" are not intended 
to include or be limited to any specific hardware configuration or to processes capable of 
2 0 only sending or only receivmg data. For example, both a sender and a receiver may be 
capable of sending and receiving data. 

According to one aspect, the invention includes a method for controlling flow of 
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data from a send buffer associated with a sender to a receive buffer associated with a 
receiver. In a preferred implementation of the invention, the only copy of the data made 
between the send buffer and the receive buffer may be the signal transmitted over the 
communication link between the sender and the receiver. Copying of data increases tune 
required to process an VO request Thus, reducing the number of copies between the send 
buffer and the receive buffer increases transmission efficiency. 

In order to control the flow of data without copying the data, the receiver may 
communicate application-level receive buffer sizes to the sender. The receiver preferably 
communicates the buffer size mformation to the sender in an efficient manner. For 
example, the more buffer size information communicated to the sender in each flow 
control comniunication. the more efficient the communication process. In one 
implementation, the receiver may communicate a list contdnmg at least one application- 
level receive buffer size to the sender, so that the sender can determine how much data 
the receiver is capable of receiving, hi preferred implementations of the mvention, the 
receiver may send a Ust containing a plurality of application-level receive buffer sizes to 
the sender. One method for communicatmg the list of buffer sizes to the sender is by 
sending a message, eg., a packet, from the receiver to the sender over a data channel 
established between the sender and the receiver. The message may contain the list of 
receive buffer azes. and is hereinafter referred to as a credit message. The receive buffer 
D sizes m the credit message are hereinafter referred to as credits. 

Tlie sender may utilize the credits in the credit message to determine the size and 
order of data packets to be sent to the receiver. For example, the sender preferably does 
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not exceed the size indicated by a particular credit or send data when no credits are 
available. In addition, the sender preferably uses the credits in the order that the credits 
are received from the receiver, so that the receiver can receive data into the correct 
buffers. Because the credits are preferably indicative of application-level receive buffer 
5 sizes, the data sent by the sender can be received directly into aUocated application-level 
receive buffers. Thus, the credit-based flow control methods and systems according to 
the invention provide both reliable and efficient data transfer between senders and 
receivers. 

Another method for communicating credits to the sender is using shared memory. 

10 For example, the sender and the receiver may each comprise a process or processes 

executing on the same machine or on different machines that utilize shared memory to 
communicate with each other. The shared memory may include a control portion and a 
data portion. In order to control flow, the receiver may write credits to the control portion 
indicative of receive buffer sizes in the data portion available for receiving data. The 

1 5 sender may read the credits in the control portion to determine how to partition data being 
sent 

■ Still another method for communicating credits to the sender is a remote direct 
memory access (RDMA) write operation. In RDMA write operations, the receiver may 
send a list of credits directly to the memory of a remote machine on which the sender 
20 executes. Hxe sender may poll the memory location or locations of the buffer that 
receives RDMA transfers to determine when credits are available. Alternatively, the 
sender may be notified asynchronously of tixe arrival of credits in the RDMA buffer. The 
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sender may use the credits in the manner previously described to determine how to 
partition and send data to the receiver. 

In implementations of the invention v^ere credit messages are used to deUver 
credits to the sender, the credit messages may be delivered using a new protocol or by 
5 extending an existing protocol. For example, in a new protocol, the sender and the 

receiver may exchange credit messages over a control chamiel established exclusively for 
the exchange of cre^t messages. In order to extend an existing protocol, credits may be 
communicated to the sender using optional data fields in the existing protocol. For 
example, in TCP. credits may be communicated to the sender using the OPTIONS field 
10 in any TCP packet, such as a TCP acknowledgment packet. The TCP sender may then 
send data to tiie receiver having lengflis corresponding to the credits. 

According to another aspect, tiie present invention may include metiiods and 
systems for determining when to communicate credits to a sender. The receiver 
preferably communicates credits to the sender in a timely mamier. For example, if tiie 
15 sender has data to be sent and tiie receiver fails to timely notify tixe sender of tiie available 
receive buffer space, sending may be delayed. In order to avoid delays in sending, the 
receiver may monitor credits sent to tiie sender, tiie rate at which tiie sender uses tiie 
credits, and/or when tiie sender uses particular credits in a cre<fit list previously 
commuiucated to tiie sender. Based on tiie monitored information, tiie receiver may 
20 detexmmewhentocommunicatenewcreditstotiiesendertoavoidtiieconditionwhere 

tiie sender has data to send but has no credits. For example, tiie receiver may 
communicate new credits to tiie sender after receiving data fromtiie sender intoafirst 
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receive buffer specified in a credit list previously communicated to the sender. In another 
aitemative. the receiver may communicate a new credit list to the sender when the receive 
buffer corresponding to a buffer size near the end of the previous credit list receives data 
from the sender. In yet another aitemative. new credits may be communicated to the 
5 sender when a receive buffer between the first and last buffers in the previous credit Ust 

receives data from the sender. 

Since credits may be received in a finite-sized buffer managed by the sender, the 
flow of credits from the receiver to the sender is preferably controlled. In order to control 
credit flow, the receiver may utilize the receipt of data from the sender as an indication 
10 that there is a buffer available to receive new credits. For example, the sender prefembly 
only sends data to the receiver when the sender has been notified through a credit list that 
a receive buffer is available. Thus, when the receiver receives data from the sender, the 
receiver knows that a previous credit Ust has been successfully communicated to the 
sender. When the sender receives new credits from the receiver, the sender preferably 
15 postsanewreceivebuffertoreceiveadditionalcredits. The sender is preferably 

prevented from using credits in the new credit list until the buffer for receiving the next 
credit list is posted. Thus, when the receiver receives data from the sender corresponding 
to the first crecKt in a new credit list, the receiver also knows that a buffer for receiving 
additional credit lists is available. One additional assumption made by the receiver is that 
2 0 the sender initially. Le.. before any credit messages or data is transferred, has at least one 
buffer available for receiving credit lists. Finally, the size of the credit list is preferably 
no greater than the size of the sender' s credit list buffer or the network MW between the 
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sender and receiver, whichever is smaller. Thus, based on these rules, the present 
invention reliably implements flow control of credits. 

According to another aspect, the present invention includes a method for 
controlling data flow between a sender and a receiver. The method includes 
5 communicating a first credit list to a sender. The first credit list may include a plurality 
of credits indicative of buffer sizes of receive buffers accessible by the receiver and 
capable of receiving data from the sender. In response to receiving the first credit Ust, the 
sender transmits a data packet to the receiver. The data packet is no greater in size than a 
first buffer size specified by a first credit in the first credit list. 
1 0 According to anotiier aspect, the present invention includes a credit list 

buflder/communicator including computer-executable instructions embodied in a 
computer-readable medium for performing steps. The steps may include receiving 
requests for receiving data mto a plurality of receive buffers accessible by a receiver and 
capable of receiving data from a sender. In response to the requests, tiie credit list 
15 builder/communicator may build a credit list includmg a plurality of credits indicative of 
sizes of a plurality of receive buffers. After buil<Kng a credit list, the credit list 
buUder/communicator may communicate tixe credit list to the sender. 

According to another aspect, the present invention may include a data structure 
for controllmg data flow between a sender and a receiver. The data structure may include 
20 acreditlistindudingapluraUtyofcredits. Each ^edit in the credit list is indicative of a 
buffer size of a receive buffer accessible by a receiver and capable of receiving data from 
a sender. 
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According to another aspect, the present invention may include a credit list 
reader/processor including computer-executable instructions embodied in a computer- 
readable medium for performing steps. The steps may include posting a furst buffer for 
receiving credits from a receiver. The credit list reader/processor may determine whether 
5 credits have been received in the first buffer, and, in response to receiving credits m the 
first buffer, the credit list reader/processor may post a second buffer for receiving 
additional credits. After posting the second buffer, the credit Ust reader/processor may 
store credits from the first buffer in a credit list 

According to another aspect, the present invention may include a credit list 
1 0 builder/communicator including computer-executable instructions embodied in a 

computer-readable medium for performing steps for determinmg when to communicate 
additional credits messages to a sender. The steps may include communicating a first 
credit Ust to a sender. The credit list builder/communicator may then deteimme if data 
has been received in a first buffer corresponding to a first credit in the first credit list. In 
1 5 response to determining that data has been received in the first buffer, the credit list 
builder/communicator may communicate a second credit list to the sender. 

According to another aspect, the present invention may include a credit list 
builder/communicator including computer-executable instructions for performing steps 
for determining when to communicate new credits to a sender. The steps may include 
2 0 communicating a first credit Ust to a sender. After communicating the first credit Ust to 
the sender, the credit Ust buildei/ccmmunicator may monitor the frequency at which the 
sender consumes credits in the first credit Ust The credit Ust bmlder/communicator may 
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detennine when to communicate a second credit list to the receiver based on the 
frequency. For example, the credit list builder/communicator may determine a triggering 
buffer corresponding to a credit m the &st credit Ust based on the frequency. The credit 
Ust builder/communicator may instruct an input/output device to send the second credit 
5 message to the sender when the triggering buffer receives data. In an alternative 
arrangement, rather than determining a triggering buffer, the credit list 
builder/communicator may determine a time in time units, such as milUseconds, for 
deteimining when to send a new credit message to the sender, based on the frequency. 
According to another aspect of the invention, the receiver may utilize credits to 
1 0 implement quality of service features. For example, the receiver may be a server that 
provides services to a pluraUty of client senders. Since the server may concurrently 
receive data from multiple cUents. it may be desirable for the server to impose a 
maximum allowable bandwidth restriction on each clients, to prevent the server from 
being overrun with data. One way that the sender may control the bandwidth is by 
1 5 regulatmg the number of unused credits available to each cUent so that no cUent has 

enough credits to exceed the maximum allowable bandwidth. By using available credits 
to regulate maximum bandwidth for each client, the server maintans a given quaUty of 

service for aU clients. 

According to another aspect, the present invention may include a credit Ust 
2 0 builder/communicator including computer-executable instructions embodied in a 

computer-readable medium for performing steps. Hie steps may include operating in a 
first mode for determining when to communicate new credits to a sender. The credit list 
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builder/coimnunicator may receive in-band infonnation from the sender and analyze the 
in-band information. If the in-band information indicates that switching would mcrease 
I/O performance, the credit list builder/communicator may switch to a second mode for 
determining when to communicate new credits to the sender. 

According to another aspect, the present invention may include an input/output 
device. The mput/output device may include a processing circuit and a memory device 
coupled to the processing circuit. For example, the processing circuit may comprise a 
microprocessor and the memory device may comprise on-chip memory of the 
microprocessor. Alternatively, the memory device may comprise a memory chip external 
to the chip containing the processing circuit. The memory device may comprise a 
general-purpose memory, such as a read-only memory that stores computer-executable 
instructions. Alternatively, the memory device may comprise an application specific 
integrated circuit that implements the computer-executable instructions in hardware. The 
computer-executable instructions included in or implemented by the memory device may 
perform steps. The steps may include receiving requests for receiving data mto receive 
buffers stored m virtual memory locations of a host computer connectable to the 
input/output device. The next step may include building a credit list includmg a pluraUty 
of credits indicative of sizes of the receive buffers. FmaUy, after building the credit Ust, 
the next step may include communicating the credit list to the sender. 

Accor^g to another aspect, the present invention may inchide an iiq)ut/output 
device. The input/output device may include a processing circuit and a memory device, 
as pre^ously described. The computer-executable instructions included in or 
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implemented by the memory device may perform steps. The steps may include posting a 
first bufi-er accessible by a sender for receiving credits from a receiver. The next step 
may include determining whether credits have been received in the first buffer. In 
response to receiving credits in the first buffer, the next step may include posting a 
5 second buffer accessible by the sender for receiving additional credits from the receiver. 
After posting the second buffer, the next step may mclude storing credits from the first 

buffer in a credit list 

According to another aspect, tiie present mvention may include a network 
communications system. The network communication system may include a first local 
10 virtual interface, a second locd virtual interface, and a credit list builder/co^ 

The first local virtual interface may send data to and receive data from a first remote 
virtual interface over a first network comiectioa The second local virtual interface may 
send credit messages to and receive credit messages from a second remote virtual 
interface over a second network comiection. The credit list builder/communicator may 
15 buUd credit messages for controlling data flow over tiie first network comiection and 
communicate tiie credit messages to tiie second remote virtual interfece tiirough tiie 
second local virtual interface and tiie second network comiection. The credit messages 
may include credit lists including a plurality of credits indicative of buffer sizes of receive 
buffers for receiving data tiirough tiie first local virtual interface from tiie first remote 
20 virtual interface. Alternatively, each virtiial interface may be used to communicate data in 
one direction while communicating credit messages in tiie reverse direction. 
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Additional features and advantages of the invention vdll be made apparent from 
the following detailed description of illustrative embodiments which proceeds with 
reference to the accompanying figures. 

5 RRTEF DESCRIPTION OF THE DRAWINGS 

While the appended claims set forth the features of the present invention wth 
particularity, the invention, together with its objects and advantages may be best 
understood from the following detailed description taken in conjunction with the 
accompanying drawings of which: 
1 0 Figure 1 is a block diagram generally illustrating an exemplary computer system 

on which embodiments of the present invention may reside; 

Figure 2 is a block diagram illustrating a sender and a receiver including a system 
for controlling data flow according to an embodiment of the present invention; 

Figure 3 is a more detailed block diagram of the sender and the receiver including 
15 the system for controlling data flow according to the embodiment of Figure 2; 

Figure 3(a) is a detailed block diagram of the sender and the receiver accordmg to 
an alternative embodiment of the invention; 

Figure 4 is a flow chart illustrating steps that may be performed by a credit list 
bmlder/communicator of a receiver for deteraiining when to communicate new credits to 
20 a sender according to an embo(timent of tiie present inventioi^ 

Figure 5 is a flow chart illustrating exemplary steps that may be performed by a 
credit list builder/commimicator of a receiver for detemiining vfbea to communicate new 
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credits to a sender according to another embodiment of the present invention; 

Figure 6 is a flow chart illustrating exemplary steps that may be performed by a 
credit list builder/communicator of a receiver for determining whether to switch from a 
first mode to a second mode for determining when to communicate new credits to a 
sender according to an embodiment of the present invention; 

Figures 7(a) and 7(b) are flow charts illustrating exemplary steps tiiat may be 
performed by a credit list reader/processor of a sender for reading and processing credits 
according to an embodiment of the present inventioi^ 

Figure 8 is a flow diagram illustrating an example of tiie transfer of credits to and 
tiie use of credits by a sender according to an embodiment of tiie present invention. 

SPFCTFIC nFSrRTPTION OF THE I NVENTION 

Turning to tiie dra\idngs, wherein like reference numerals refer to like elements, 
tiie mvention is illustrated as bemg implemented in a suitable computing envhronment. 
Altiiough not required, tiie invention will be described in tiie general context of computer- 
executable instructions, such as program modules, being executed by a personal 
computer. Generally, program modules include routines, programs, objects, components, 
data structures, etc. tiiat perform particular tasks or implement particular abstract data 
types. Moreover, tiiose skilled in tiie art will appreciate tiiat tiie mvention may be 
practiced witii otiier computer system configurations, inclu^g hand-held devices, multi- 
processor systems, microprocessor based or programmable consumer electronics, 
network PCs. minicomputers, manframe computers, and tiie like. The invention may 
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also be practiced in distributed computing environments where tasks are performed by 
remote processing devices that are linked through a communications network. In a 
distributed computing environment, program modules may be located in both local and 
remote memory storage devices. 
5 With reference to Fig. 1, an exemplary system for implementing the invention 

includes a general purpose computing device in the fonn of a conventional personal 
computer 20, inchiding a processing unit 21 , a system memory 22. and a system bus 23 
that couples various system components including the system memory to the processing 
tmit 21. The system bus 23 may be any of several types of bus structures including a 
10 memory bus or memory controller, a peripheral bus, and a local bus using any of a variety 
of bus architectures. The system memory includes read only memory (ROM) 24 and 
random access memory (RAM) 25. A basic input/output system (BIOS) 26, containing 
the basic routmes that help to transfer information between elements within the personal 
computer 20, such as during start-up. is stored in ROM 24. The personal computer 20 
1 5 further includes a hard disk drive 27 for reading from and writing to a hard disk, not 

shown, a magnetic disk drive 28 for reading from or writing to a removable magnetic disk 
29, and an optical disk drive 30 for reading from or writing to a removable optical disk 31 
such as a CD ROM or otiier optical media. 

The hard disk drive 27, magnetic disk drive 28, and optical disk drive 30 are 
2 0 connected to tiie system bus 23 by a hard disk drive interface 32, a magnetic disk drive 
interface 33, and an optical disk drive interface 34, respectively. The drives and their 
associated computer-readable media provide nonvolatile storage of computer readable 



- 16 - 



ariSBSSIBi :Wfl M11»ffiftlJ_r 



wo 00/41365 PCTAJS99/30860 



instructions, data structures, program modules and other data for the personal computer 
20. Although the exemplary environment described herein employs a hard disk, a 
removable magnetic disk 29, and a removable optical disk 3 1 , it will be appreciated by 
those skilled in the art that other types of computer readable media which can store data 
5 that is accessible by a computer, such as magnetic cassettes, flash memory cards, digital 
video disks, Bernoulli cartridges, random access memories, read only memories, and the 
like may also be used in the exemplary operating environment. 

A number of program modules may be stored on the hard disk, magnetic disk 29. 
opticaldisk31,ROM24orRAM25,includinganoperatingsystem35. oneormore 

10 applications programs 36, other program modules 37, and program data 38. THe 

operating system 35 may include a virtual memory manager and one or more I/O device 
drivers that communicate with each other to mdntmn coherence between virtual memory 
address mapping information stored by the operating system 35 and virtual memory 
mapping information stored by one or more VO devices, such as network interface 
15 adapters 54 and 54a. A user may enter commands and information into tiie personal 
computer 20 through mput devices such as a keyboard 40 and a pointing device 42. 
Other input devices (not shown) may include a microphone, touch panel, joystick, game 
pad, satellUe dish, seamier, or the like, "mese and otiier input devices are often comiected 
to the processing unit 21 tiirough a serial port interface 46 that is coupled to the system 
20 bus. but may be comiected by other mterfaces, such as a parallel port, game port or a 
universal serial bus (USB). A monitor 47 or otiier type of display device is also 
comiected to the system bus 23 via an interface, such as a video adapter 48. In addition to 
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the monitor, personal computers typically include other peripheral output devices, not 
shown, such as speakers and printers. 

The personal computer 20 may operate in a networked environment using logical 
connections to one or more remote computers, such as a remote computer 49. Tlie 
5 remote computer 49 may be another personal computer, a server, a router, a network PC. 
a peer device or other common network node, and typically includes many or all of the 
elements described above relative to the personal computer 20, although only a memory 
storage device 50 has been iUustrated in Fig. 1. The logical connections depicted in Fig. 
1 include a local area network (LAN) 51. a wide area network (WAN) 52, and a system 
10 area network (SAN) 53. Local- and wide-area networking environments are 

commonplace in offices, enteiprise-wide computer networks, intranets and the Internet. 
System area networkmg enviromnents are used to inter«)miect nodes vdthin a distributed 
computing system, such as a cluster. For example, in tiie Ulustrated embodiment, the 
personal computer 20 may comprise a first node in a cluster and the remote computer 49 
15 may comprise a second node in the cluster. In such an environment, it is preferable that 
the personal computer 20 and tiie remote computer 49 be under a common administrative 
domain. Urns, although tiie computer 49 is labeled "remote", the computer 49 may be in 
close physical proxinuty to tiie personal computer 20. 

When used in a LAN or SAN networking environment, tiie personal computer 20 
20 is comiected to tiie local network 51 or system network 53 tiirough tiie network interface 
adapters 54 and 54a. The network interfece adapters 54 and 54a may include processing 
units 55 and 55a and one or more memory units 56 and 56a. Hie memory units 56 and 
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56a may contain computer-executable instructions for processing VO requests including 
translating virtual memory addresses to physical memory addresses, obtaining virtual 
address mapping information from the operating system 35, and recovering from local 
address translation failures. The memory units 56 and 56a may also contain page tables 
5 used to perform local virtual to physical address translations. 

When used in a WAN networking environment, the personal computer 20 
typically includes a modem 58 or other means for establishing communications over the 
WAN 52. The modem 58, which may be internal or external, is connected to the system 
bus 23 via the serial port interface 46. In a networked environment, program modules 
1 0 depicted relative to the personal computer 20, or portions thereof, may be stored in the 
remote memory storage device. It will be appreciated that the network connections 
shown are exemplary and other means of establishing a communications link between the 

computers may be used. 

When used in any of the networking environments illustrated in Figure 1 , data 

15 flow is preferably regulated between processes executing on the personal computer 20 
and processes executing on the remote computer 49 that communicate with each other. 
For example, the personal computer 20 may include a sender for sending data through 
one of the network interface adapters 54 and 54a to a receiver executing on the remote 
computer 49. Accordmgly, in order to regulate data flow between tiie sender and the 

2 0 receiver, tiie sender may include a credit list reader^rocessor for receiving and 
processing credits from tiie receiver. The receiver may include a credit Ust 
builder/communicator for building credit Usts and communicating tiie credit Usts to tixe 
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sender. 

The present invention is not limited to regulating flow between processes 
executing on separate computers. The credit list builder/communicator and the credit list 
reader/processor may be used to regulate flow between a sender and a receiver executing 
5 on the same machine. For example, the sender and the receiver each may comprise an 
application program executing on the personal computer 20 that utilize a shared memory 
region for communicating with each other. The shared memory region may include a 
data portion and a control portion. In order to regulate flow, the credit list 
builder/processor of the receiver may write credits to the control portion of the shared 
1 0 memoiy region. The credits may be indicative of receive buffer sizes in the data portion 
of the shared memory region. In order to access the credits, the credit list 
reader/processor of the sender may read the control portion of the shared memory region. 
The credit list reader/processor preferably uses the credits in the order that Ae credits are 
made available, preferably does not exceed the buffer size mdicated by each credit, and 
1 5 preferably only writes data to the data portion when credits are available. In this manner, 
flow between the sender and the receiver may be regulated using credits in shared 
memory. 

In yet another alternative, where the sender and receiver are executing on 
different machines, RDMA write operations may be used to commumcate credits from 
20 the receiver to the sender. In RDMA write operations, the credit list 

builder/communicator of the receiver may write credits directly to the memory of the 
machine on which the credit message reader processor of the sender executes. In order to 
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perform an RDMA write operation, the credit list builder/commvmicator may construct a 
packet containing a list of credits and the destination memory address of the sender where 
the credits wUl be stored. The sender may receive the packet directly into the specified 
memory address. In order to use the credits, the credit list reader/processor may read the 
5 memory location that receives the RDMA packet. The credit list reader/processor may 
use the credits to send data to the receiver in the numner previously described. Thus, 
RDMA write operations provide yet another mechanism for communicating credits to the 

sender. 

In the description that follows, the invention will be described with reference to 
10 acts and symbolic representations of operations tiiat are performed by one or more 

computers, unless indicated otiierwise. As such, it vAU be understood that such acts and 
operations, which are at times referred to as being computer-executed, include tiie 
manipulation by tiie processing unit of the computer and/or tixe processing units of I/O 
devices of electrical signals representing data in a structured form. Tlus manipulation 
15 transforms tiie data or mdntains it at locations intiie memory system of ti.e computer 
and/or tiie memory systems of UO devices, which reconfigures or otiierwise alters the 
operation of the computer and/or the I/O devices in a mamier well understood by those 
skilled in the art The data structures where data is mamtained arc physical locations of 
thememorythathaveparticularpropertiesdefmedbytheformatoftixedata. However. 

20 while the invention is being described in the foregoing context, it is not meant to be 

linuting as those of skill m the art wiU appreciate that the acts and operations described 
hereinafter may also be implemented in hardware. 
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Figure 2 illustrates an exemplary sender 60 and a receiver 62 including a system 
for controlling data flow according to an embodiment of the present invention. In the 
illustrated embodiment, the sender 60 and the receiver 62 may each comprise a plurality 
of processes executing on the same computer or on different computers that conMnunicate 
over a communication link 64. The communication link 64 may comprise a LAN, a 
WAN, a SAN, or any other medium for transferring signals between connected devices. 
If the sender and the receiver include application programs executing on the same 
machine, the communication link may comprise a bus, such as a data bus. The sender 60 
may mclude a sending application 66 for requesting the sending of data stored in one or 
more send buffers 68 from an I/O device 70 to other applications. For example, the 
sending appUcation 66 may comprise a web server that sends data to other applications, 
such as the receiving application 74 over the communication link 64. The VO device 70 
may comprise any type of device for sending and receiving data m response to requests 
from an appUcation. For example, the UO device 70 may comprise a network interfece 
adapter, such as an Ethernet adapter. In order to reduce copying of data between the 
sending application 66 and the I/O device 70, the I/O device 70 is preferably capable of 
translating vutual memory addresses of data to be sent to physical memory addresses. 
Exemplary mechanisms for translating virtual memory addresses to physical memory 

addresses are described m copending U.S. Patent AppUcation No , filed 

December 29, 1998, entitled, "Recoverable Methods for and Systems for Processing 
Input/Output Requests," (Leydig. Voit & Mayer. Ltd. Attorney Docket No. 89079) the 
disclosure of which is incorporated herein by reference in its entirety. 
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The sender 60 may also include an yO device interface 72 for controUmg 
communication between the sending application 66 and the I/O device 70. For example, 
the I/O device interface 72 may include communications functions, such as sockets, MPI, 
and cluster functions that may be called by the sending application when requesting 
5 sending of data. The VO device interface 72 may convert the requests into data structures 
recognizable by the I/O device 70. In order to reduce the copying of data between the 
sending application 66 and the I/O device 70. the yO device interface 72 may also 
include memory registration functions for registering memory used by applications with 
tiie VO device 70. However, because tiie VO device 70 is preferably capable of 
10 recovering from local virtual address translation failures, memory registration may not be 
required. 

The receiver 62 may include tiie receiving application 74 for requesting receipt of 
data from an VO device 76 into one or more receive buffers 78. For example, tiie 
receiving application 74 may comprise a web browser tiiat receives data sent over a 
15 network from otiier applications, such as tiie sending application 66. The I/O device 76 
of tiie receiver may comprise any device capable of sending and receiving data over a 
communication link in response to requests from tiie receiving appUcation 74. The 
receiver .62 preferably also includes an I/O device interface 80 for controUing 
communication between tiie receiving application 74 and tiie I/O device 76. The I/O 
20 device 76 and tiie yO device interface 80 nmy be siniilar in structure to tiie I/O device 70 

and tiie VO device interface 72 of tiie sender and need not be farflier described. 

According to an important aspect of tiie invention, tiie lecdver 62 communicates 
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credits to the sender 60 to control the flow of data packets 84 sent by the sender 60. In 
the illustrated embodiment, communicating the credits to the sender mcludes sending 
credit messages 82 to the sender. The credit messages may be variously configured. In a 
preferred embodunent, the credit messages may include credit lists containing buffer size 
information relating to the size of one or more receive buffers 78 in which the receiving 
appUcation 74 may request receipt of data. In order to generate credit lists and 
communicate credits to the sender, the receiver may include a credit list 
builder/communicator 83. When the receivmg qjplication requests receipt of data into 
one or more of the receive bujffers 78, e.g., by communicating the buffer virtual addresses 
and sizes to the I/O device interface 80. the credit list builder/communicator 83 may 
generate credit messages includmg the sizes of the receive buffers 78 and forward the 
credit messages to the device 76 to be sent to the sender 60. Alternatively, the credit 
list builder/communicator may communicate credits to the sender using a shared memory 
buffer or through RDMA write operations, as previously described. The credit list 
builder/communicator 83 may also detenninc when to communicate new credits to the 
sender 60. Methods for determining v/bsa to communicate new credits to the sender 60 
are discussed in more detail below. 

In order to process the credits received from the receiver, the sender may mclude a 
credit list reader/processor 75. The credit list reader^cessor 75 may be variously 
configured. For example, the credit Ust reader/processor 75 may receive credit messages 
82 from the receiver 62 and extract credits including buffer size information from the 
credit messages. Alternatively, when the receiver communicates credits to the sender 
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using a shared memory region or an RDMA write operation, the credit message 
reader/processor may read data from the shared memory region or the buffer for receiving 
RDMA writes. 

According to an important aspect of the invention, the credit list reader/processor 
5 75 preferably uses the credits to determine the size of data packets to be sent to the 
receiver. In addition, the credit list reader/processor preferably uses the credits in the 
order that the credits were received, so that the receiver will receive data in the correct 
buffers. For example, the credit message 82 may indicate that the receiving application 
74 has a first buffer of four bytes for receiving data and a second buffer of two bytes for 

1 0 receiving data. The sending application 66 may have a send buffer of six bytes to be sent 
to the receiving application. Under these conditions, the credit list reader/processor 75 
may request that the I/O device 70 send a first data packet of four bytes and a second data 
packet of two bytes to the receiver 62, e.g., by communicating the virtual addresses of the 
data to be sent along with the appropriate sizes to the I/O device 70. The credit list 

15 reader/processor 75 preferably maintains a list of credits received from the receiver 62 
and removes credits from the list as the sender uses the credits. Thus, because the 
receiver 62 preferably commvmicates credits indicative of application buffer sizes to the 
sender, and the sender 60 constructs data packets having sizes based on the credits, data 
flow between the sender and the receiver may be efficiently regulated. Moreover, 

2 0 software copying, segmentation, and reassembly of data may not be required according to 
preferred implementations of the invention because the data packets sent to the receiver 
are preferably no greater in size than the corresponding receive buffers. 
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In order to communicate credit messages to the sender, the credit list 
builder/communicator may utilize the same connection or a separate connection from the 
connection used for receiving data from the sender. In a preferred embodiment, the 
sender and the receiver send and receive data over one or more data connections and 
exchange credit messages over a control comiection separate from the data connections. 
When the sender and tiie receiver communicate over multiple data connections, the credit 
message builder/communicator may multiplex credits or credit messages transmitted over 
the control connection. Each oedit or credit message in the multiplexed control channel 
may indicate the data connection to which it pertains. Thus, the methods for controlling 
data flow according to the present invention are applicable to a single data connection or 
to a plurality of data connections. The credit message reader/processor of the sender may 
demultiplex tiie credit messages on the control channel and use the credits m the credit 
messages to send data over tiie corresponding data connections. In order to prevent credit 
message overflow on the control connection, tiie credit message reader/processor 
preferably maintains a credit message buffer for each data coimection. 

In yet another alternative embodiment, the receiver may communicate with a 
plurality of senders. For example, the sender may comprise a server and the receivers 
may comprise clients. In such an embodiment, the credit message builder/commumcator 
of tiic receiver may receive data from a plurality of client senders utilizing a separate data 
connection for each sender. The credit message builder/communicator of tiie receiver 
may communicate credits to each sender utilizing a separate control connection for each 
sender. 
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The credit message builder/communicator and the credit message reader/processor 
may be implemented in hardware, software, or a combination of hardware and software. 
For example, in the embodiment illustrated in Figure 2, the credit message 
builder/communicator and the credit message reader processor may be components of the 
5 communications provider software included in the I/O device interfaces 72 and 80. In an 
alternative embodiment, the credit message builder/communicator and the credit message 
reader/processor may be implemented in hardware of the I/O devices 70 and 76. 
Implementing the credit message builder/communicator and the credit message 
receiver/processor in the hardware of the I/O devices allows flow control to be performed 
1 0 transparently to the communications provider software. 

Although the embodiment illustrated in Figure 2 shows a sender 60 and a receiver 
62 respectively having a credit list reader/processor 75 and a credit list 
builder/communicator 83, the present invention is not intended to be limited to such an 
embodiment. For example, the sender 60 and the receiver 62 may each be capable of 
1 5 sending and receiving data. Thus, the I/O device interface 72 of the sender 60 may 

include a credit list builder/communicator 83 in addition to the credit list reader/processor 
75. Similarly, the I/O device interface 80 of the receiver 62 may include a credit list 
reader/processor 75 in addition to the credit list builder/commuxiicator 83. 

Figure 3 is a more detailed block diagram of the sender 60 and the receiver 62 
2 0 illustrated in Figure 2. The sender 60 and the receiver 62 illustrated in Figure 3 

preferably implement the Virtual Interface Architecture (VIA). According to the VIA 
architecture, the efficiency of I/O operations may be mcreased by granting I/O devices 
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direct access to application-level data buffers so that copying of data between 
applications and the I/O devices is not required. In order to provide I/O devices direct 
access to application-level buffers, the I/O device interfaces 72 and 80 communicate 
descriptors to the I/O devices. A descriptor is a data structure containing I/O request 
5 processing information, such as the virtual memory address and size of a send or receive 
buffer. The I/O devices translate the virtual memory addresses in the descriptors to 
physical memory addresses and either send data from or receive data into a buffer at the 
physical memory address. The buffer size information in the descriptors may also be 
used by the credit list builder/communicator 83 to generate credit messages. 
10 In Figure 3, the I/O devices 70 and 76 preferably each comprise a VIA network 

interface adapter capable of sending and receiving data and credit messages over the 
communication link 64. A VIA network mterface adapter may comprise any type of 
network adapter capable of high-speed commimications, for example, an Ethernet card, 
such as a gigabit Ethernet card. In addition, the VIA network interface adapter is 
15 preferably capable of translating virtual memory addresses of buffers used in I/O 
operations into physical memory addresses. 

The I/O device interfaces 72 and 80 of the sender 60 and the receiver 62 each 
comprise a plurality of components for controlling communications between the 
applications 66 and 74 and the UO devices 70 and 76. For example, the I/O device 
2 0 interfecc 72 of the sender 60 may include an operating system communication interface 
88 and a virtual interface (VI) user agent 89. The I/O device interface 80 of the receiver 
62 may also include an operating system conununication interface 90 and a VI user agent 
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91. The operating system communication interfaces 88 and 90 and the VI user agents 89 
and 91 of both the sender and the receiver may convert requests from the sending and 
receiving applications into data structures, such as descriptors, for processing by the VO 
devices 70 and 76. Accordingly, the operating system communication interfaces 88 and 
5 90 may include standard communications fimctions for performing network VO, such as 
sockets, MPI, cluster, or other communications functions. The VI user agents 89 and 91 
may communicate memory registration requests through commumcation links 92 and 93 
to VI kernel agents 94 and 95. The VI kernel agents 94 and 95 may be components of the 
operating systems of the sender and the receiver that function as device drivers for the I/O 
10 devices 70 and 76. The VI kernel agents 94 and 95 may receive the memory registration 
requests from the VI user agents 89 and 91 and register memory used by the sending and 
receiving applications 66 and 74 with the I/O devices 70 and 76. In addition, the VI 
kernel agents 94 and 95 may estabUsh and break connection witii remote machines. The 
VI kernel agents 94 and 95 may also manage one or more virtual interfaces, such as 
15 virtual interface 96 of tiie sender 60 and virtual interface 97 of the receiver 62. 

The virtual interfaces 96 and 97 may comprise communication interfaces between 
the sendmg and receiving applications 66 and 74 and the I/O devices 70 and 76. The 
virtual interface 96 of tiie sender 60 may include a send queue 98 and a receive queue 99. 
Similarly, tiie virtual interface 97 of tiie recdver 62 may include a send queue 100 and a 
20 receive queue 101. In order to request an I/O operation, tiie sending and receiving 

applications 66 and 74 may execute standard I/O commands, such as Winsock sendQ and 
WinsockrecvO. In response to tiiese commands, tiie VI user agents 89 and 91 may post 
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descriptors 102-109 to the send and receive queues of the sender and the receiver and 
notify the I/O devices of the posting of the descriptors. Posting the descriptors may 
include writing pointers to the virtual memory addresses of the descriptors to the virtual 
memory addresses of the queues. Notifying the I/O devices of tiie descriptors may 
include writing doorbell tokens including the virtual memory addresses of the descriptors 
to virtual memory addresses of doorbells associated wi& each queue. The VI kernel 
agents may map the virtual memory addresses of the doorbells to physical memory 
addresses of doorbell control registers associated with the I/O devices. When the I/O 
devices 70 and 76 receive doorbell tokens, the devices preferably increment a descriptor 
counter for the associated queue. The I/O devices 70 and 76 decrement the counters 
when descriptors are processed. The I/O devices 70 and 76 preferably process the 
descriptors in the order that the descriptors are posted in the send and receive queues and 
perform the requested I/O operations. The I/O devices preferably process the descriptors 
until the queues are empty or until an unrecoverable error occurs. 

In ordCT to request the sending of data, the sending application 66 may transmit a 
request for sending data to the VI user agent 89, which posts descriptors 102 and 103 in 
the send queue 98 of Ac sendo: virtual mtoface 96 and rings the doorbell of the srad 
queue 98 once for each descriptor. The descriptors 102 and 103 may specify the vutual 
memory addresses of send buffers 68 contaimng data to be sent to the receiver. Once the 
descriptors are posted in the send queue 98, the I/O device 70 of the sender preferably 
locates the data at the virtual memory addresses indicated in the descriptors and sends the 
data to the receiver. 
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In order to receive data from the sender, the receiving application 74 preferably 
sends receive data requests to the VI user agent 91, which posts descriptors 106 and 107 
in the receive queue 101 specifying one or more receive buffers 98 to store data from the 
sender. However, if no descriptors are posted m the receive queue 101 when the data 
5 arrives, connection between the sender and receiver may be broken. Similarly, because 
the receiver may not perform segmentation or reassembly of data, if data in a given data 
packet from the sender exceeds the size of the receive buffer in the descriptor 107 at the 
head of the receive queue 101, connection may also be broken. Accordingly, it is 
desirable to coordinate posting of descriptors in the send queue 98 of the sender with the 
1 0 posting of descriptors in the receive queue 1 01 of the receiver, i.e., it is desirable to 
control flow between the sender and the receiv«. 

In order to control flow between the sender 60 and the receiver 62, the credit list 
builder/communicator 83 builds credit messages based on sizes of the receive buffers 
contained in receive data requests initiated by the receivmg application 74. For example. 
15 when the receiving application posts a descriptor in the receive queue, the credit list 

builder/communicator 83 may record the size of the buffer specified by the descriptor in a 
credit message. The credit Ust builder/communicator 83 may repeat this process for each 
descriptor posted in the receive queue. When the number of credits in the receive queue 
reaches a predetermined value or when the credit list builder/communicator 83 
2 0 determines that the sender needs credits, the credit list builder/communicator 83 
preferably requests that the I/O device 76 send a credit message 82 to the sender. 
Methods for detennimng when the sender needs credits will be discussed in more detwl 
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below. 

Credit messages may be sent to the sender in any suitable manner, for example, by 
posting a descriptor in the send queue 1 00 of the receiver containing the size and virtual 
memory address of the credit message and ringing the send queue doorbell. However, in 

5 a preferred embodiment of the present invention, the sender and the receiver use a 

separate connection from the connection(s) for sending and receiving data for the sending 
and receiving of credits. Accordingly, since each virtual interface may coimect to one 
remote virtual interface to form one network connection, the sender and the receiver may 
each include an additional virtual interface for sending and receiving credit messages. In 
1 0 addition, in an alternative embodiment of the invention, a single sender and a single 

receiver may commimicate over multiple data connections. In such an embodiment, the 
sender and the receiver may each include multiple virtual interfaces for the data 
connections and a single virtual interface for a control connection for the exchange of 
credit messages. The credit message builder/communicator of the receiver may multiplex 

1 5 credits or credit messages sent over the control connection to ibc sender. Each credit or 
credit message may specify the data connection or virtual interface to which it pertains. 
The credit message reader/processor of the sender may demultiplex the credits and use 
the credits to control the sending of data over the corresponding data connection. In order 
to prevent credit message overflow, the credit message reader/processor may maintain a 

20 separate credit message buffer for each data connection. In yet another alternative 

embodiment, the receiver may comprise a server that communicates with a plurality of 
client sendMs. In such an embodiment, the receiver may include one virtual interface for 
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sending credits to and receiving credits from each of the remote senders. That is. the 
number of credit message comiections may be equal to the number or remote cUent 
senders. Any number of credit message connections and data connections is within the 
scope of the invention. 

5 Once a credit message is transmitted to the sender, the credit Ust reader/processor 

75 of the sender receives the credit message 82. Receiving a credit message may require 
the previous posting of a descriptor containing the virtual address and size of a credit 
message buffer in the receive queue 99 of the sender. Alternatively, as discussed above, 
credits may be received over a separate comiection from the connection for receiving 
10 data. The credit list reader/processor 75 may use the credits in the credit message to 
control the posting of descriptors in tiie send queue 98. For example, the sender may 
have a send buffer contaming six bytes of data to be sent to the receiver. The credit 
message 82 from tiie receiver may contain a first credit of four bytes and a second credit 
of two bytes, indicating that tiie descriptors 106 and 107 specify two-byte and four-byte 
15 receive buffers, respectively. Accordingly, the credit list reader/processor 75 may post a 
first descriptor 103 in the send queue 98 containing a pointer pointing to tiie first byte of 
the send buffer with a size of four bytes and a second descriptor 102 in the send queue 98 
contaming a pointer pointing to tiie fifth byte of tixe send buffer 68 witix a size of two 
bytes. The VO device 70 may process ti^e descriptor 1 03 and transmit a first data packet 
20 havingfourbytesofdatatothereceiver. The VO device 70 may process the second 
descriptor 102 and transmit a second data packet of two bytes of data to the receiver. 
When the receiver receives the data packets, the receiver processes tiie descriptor 107. 
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then the descriptor 106, to store the received data packets in four- and two-byte receive 
buffers, respectively. In this manner, the sender only sends data that the receiver is 
capable of receiving. As a result, data transmission overflow errors are reduced and 
transmission efficiency is increased. 

In the embodiment illustrated in Figure 3, the credit list builder/communicator 83 
and the credit list reader/processor 75 are preferably implemented in software, e.g., in the 
VI user agents 91 and 89. In an alternative embodiment, the credit list 
builder/communicator and the credit list reader/processor may be implemented in 
hardware, e.g., in hardware of the I/O devices. Implementing the credit list 
builder/communicator and the credit list reader/processor in the hardware of the VO 
devices allows flow control functions to be perforaied transparently to the 
coimnunications software of the s«ider and the receiver. 

Figure 3(a) illustrates a detailed block diagram of a sender and a receiver in vMch 
the credit list builder/communicator and the credit list reader/processor are implemented 
in hardware. In the illustrated embodiment, the credit list reader/processor 75a is a 
hardware component of the I/O device 70 of the sender and the credit list 
builder/communicator 83a is a hardware component of the I/O device 76 of the receiver. 
The remaining components m Figure 3(a) are the same as those iUustrated in Figure 3, 
and their descriptions are therefore not repeated. 

In order to regulate the flow of data from the sender, the credit list 
builder/communicator 83a of the receiver may generate a list of credits based on 
descriptors posted in the receive queue 101. The Ust of credits may be stored in memory 
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of the I/O device 76 or in memory of the host computer in which the I/O device is 
inserted. The credit Ust builder/communicator 83a may send the credit list to the sender 
by instructing the I/O device 76 to send the list directly from the memory location in 
v^hich the credit Ust is stored. The credit list buUder/communicator 83a may also 
5 determine when to commumcate new credits to the sender as will be discussed in more 
detail below. 

The credit list reader/processor 75a may receive the credit Ust and process the 
credits in order to send data to the receiver. However, unUke the credit Ust 
reader/processor 75 illustrated in Figure 3, rather than posting descriptors in the send 
1 0 queue, the credit Ust reader/processor 75a may control the sending of data specified by 
descriptors previously posted in the send queue 98 of the sender so that the size of data 
packets actually sent to the receiver corresponds to the credits. For example, a descriptor 
specifying the sending of eleven bytes of data may be located at the head of the send 
queue 98. The credit list reader/processor 75a may have two credits of five bytes and six 
1 5 bytes. Accordingly, the credit list reader/processor 75a may break the data buffer 

specified by the descriptor mto a first data packet of five bytes and a second data packet 
of six bytes. Thus, when the credit list reader/processor and the credit list 
builder/communicator are implemented in hardware, flow control can be achieved 
transparently to the VI user agents 88 and 90. 
20 As stated above, the credit list buUder/communicator 83 preferably deteimines 

when to communicate new credits to the sender. Determining when to provide the sender 
with new credits may be accomplished in any number of ways. Figure 4 illustrates 
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exemplary steps which may be perfonned by the credit list builder/conuniinicator 83 to 
determine when to communicate new credits to the sender. In step STl, the credit list 
builder/communicator 83 may receive requests for receiving data from the receiving 
application 74. The credit list builder/communicator 83 preferably determines the size of 
the receive buffer in each request and adds a credit of a corresponding size to a credit list 
Step STl is preferably executed repeatedly and concurrently with the remaining steps in 
Figure 4 to accumulate credits as requests are received from the receiving application 74, 
In steps ST2 and ST3, the credit list builder/conmiunicator 83 determines whether the 
nxmiber of accumulated credits exceeds a predetermined number or whether the sender 
has no credits. The predetermined number of credits may be based on a maximum credit 
message length, which may be determined by tiie smaller of the network MTU between 
the sender and the receiver and the size of the buffer posted by the sender to receive credit 
messages. If either condition is satisfied, the credit list builder/communicator 83 may 
commimicate a first batch or list of credits to the sender. (ST4) For example, the credit 
list builder/communicator may instruct the I/O device 76 to send a first credit message to 
the sender, e.g., by posting a descriptor having a pointer to the credit message in the send 
queue of the control connection of the receiver and ringing the send queue doorbell. In 
an alternative embodiihent, for example, where the sender and receiver communicate 
using shared memory, the credit list builder/communicator 83 may write the credit list to 
the control portion of the shared memory if both conditions are satisfied. If neither of the 
conditions is satisfied, the credit list builder/communicator may continue to accumulate 
credits. In steps ST5 and ST6, the credit list builder/communicator 83 determines 
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Whether data has been received from the sender for the first buffer specified in the first 
batch of credits commnnicated to the sender. If data has not been received in the first 
buffer, the credit list builder/communicator 83 preferably continues checking whether 
data has been received in the first buffer, i.e., without commumcating a new list of credits 
5 to the sender. If the credit list builder/communicator 83 deteimmes that data has been 
received in the first buffer specified in the first credit list, flie credit list 
builder/communicator 83 determines whether new credits are available, (steps ST7 and 
ST8) If new credits are available, the credit list builder/communicator 83 preferably 
communicates a new credit list to the sender, (step ST9) For example, the credit list 
10 builder/communicator may instruct the I/O device 76 to send a new credit Hst to the 
sender containing newly accumulated credits. The newly accumulated credits may be 
based on receive buffers contained in data receive requests initiated by the receiving 
appUcation after the previous credit message was sent. If there are no newly accumulated 
credits, the credit list buUder/communicator 83 may continue to check until new credits 
15 areavailable. After the new credit list is communicated to the sender, the credit Ust 

builder/communicator 83 determines whether data has been received in the first buffer 
specified in the new credit list, (steps STIO and STll) If data has not been received in 
the first buffer in the new credit list, the credit list builder/coimnunicator 83 preferably 
continues checking, i.e., without communicating another new credit Ust to the sender. If 
20 the credit list builder/communicator 83 determines that data has been received m the first 
buffer in tiie new credit list, the credit listbuilder/commurucator 83 preferably checks 
whether new crests are available and instructs the VO device 76 to send another new 
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credit list to the sender. 

The approach illustrated in Figure 4, in which the receiver conununicates a new 
credit list to the sender when the first buffer specified in a previous credit list is used, 
increases the number of credits available to the sender at any given time. This approach 
5 may be desirable if the sender is rapidly consuming available credits. In an alternative 
approach, steps ST5, ST6, STIO, and STl 1 can be modified so that the credit list 
builder/communicator 83 detemiines when the last buffer in a previous credit list is used 
before communicating a new credit list to the sender. This approach would reduce the 
number of credit list commimications sent by the receiver and the number of credits 
1 0 available to the sender at any given time. Such an approach may be desirable if the 

sender is not rapidly consuming available credits. In yet another altemative, the credit list 
biulder/communicator 83 may instruct the I/O device of the receiver to communicate new 
credit lists to the sender when a buffer between the first and last buffers in a previous 
credit list receives data from the sender. 
15 According to another aspect of the invention, the method for detemiining when to 

commimicate new credits to the sender is adaptable. Figure 5 illustrates an adaptable 
approach for determining when to communicate new credits to the sender. Steps STl- 
ST6 are the same as steps ST1-ST6 m Figure 4 and their description is not repeated. In 
step ST7, after data has been received in a first buffer specified in a first credit list 
2 0 previously conununicated to the sender, the credit list builder/communicator 83 

determines the frequency at vMch the buffers are being used by the sender. In step ST8, 
the credit list builder/communicator 83 detemiines ^ch buffer in the credit list currently 
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being used by the sender will trigger the sending of a new credit message, based on the 
frequency. For example, if the sender is rapidly using buffers in the current credit • 
message, a new credit message may be sent when a buffer near the begiiming of the 
current credit message is used. On the other hand, if the sender is slowly consuming the 
5 buffers in the current credit message, the credit list builder/communicator 83 may wait 
until a buffer near the end of the current credit message is used to send the new credit 
message. In steps ST9 and STIO, the credit list builder/communicator 83 determines if 
the triggering buffo: in the current credit message has received data from the sender. If 
the triggering bviffer has not received data, the credit list builder/communicator 83 
1 0 preferably continues checking. If the triggering buffer has received data, the credit list 
builder/communicator 83 may determine if any new credits have been accumulated, 
(steps STl 1 and ST12) If new credits have not been accumulated, the credit list 
builder/communicator may continue checking. If new credits have been accumulated, the 
credit list builder/communicator 83 may instruct the I/O device 76 to send a new credit 
1 5 list to the sender. In steps ST14 and ST15, the credit list builder/communicator 83 

detemiines whether data has been received in the first buffer in the new credit list. If data 
has not been received in the first buffer, the credit list builder/communicator 83 
preferably continues checking. If data has been received, the credit list 
builder/communicator 83 returns to step ST7 to determine the firequency at which the 
2 0 sender is utilizmg buffers and determine which buffer in the new credit message will 
trigger the sending of another new credit list In an alternative airangemenl, step ST7 
may be executed continuously so that the triggering buffer can be updated continuously. 

- 39 - 

8NSDOC1D: <WO ^D04ia65A1_L> 



wo 00/41365 



PCT/US99/30860 



In this manner, the number of credit messages and credits made available to the sender is 
controlled based on the rate at which the sender is using credits. In another alternative 
arrangement, the times at which buffers in a first credit message are used may be input to 
an adaptive-predictive filter. The adaptive-predictive filter may predict when the sender 
will most likely need a new list of credits. 

The present mvention is not limited to utilizing a triggering buffer to determine 
when to communicate a new credit Ust to the sender. For example, in an alternative 
embodiment, steps ST9 and STIO in Figure 5 may be replace by steps for determining a 
time, e.g., in milliseconds, for communicating a new credit list to the sender. In such an 
embodiment, the credit list builder/communicator 83 may monitor the frequency at which 
the sender consumes credits and, based on the frequency, determine to communicate new 
credits after a predetermined time period elapses. Any method for adaptively 
determinmg when to communicate new credits to the sender is within the scope of the 
invention. 

In another alternative embodunent, the credit Ust builder/communicator 83 may 
implement quality of service fimctions by regulating this rate at wluch credits are 
communicated to the sender. For example, the receiver may comprise a server that 
provides services to a plurality of client senders. The receiver may prevent any one of the 
cUents from exceeding a predetermined maximum allowable bandwidth by not sending 
credits to the cUent when doing so would allow the cUent to exceed the maximum 
aUowable bandwidth. By preventing cUents from exceeding a maxhnum allowable 
bandwidth using credits, the server can guarantee a certain quaHty of service to all cUents. 

- 40 - 



<WO ^0041365A1J_> 



wo 00/41365 PCT/US99/30860 



For example, since none of the clients can exceed the maximum allowable bandwdth, 
the processing load on the server is determined by the number of cUents and the 
maximum allowable bandwidth. If the maximum allowable bandwidth is set somewhat 
lower than the bandwidth that the server is capable of servicing for a given connection, 
5 the server can service a greater number of clients with less hardware. In contrast, without 
a maximum bandwidth limitation, in order to guarantee service to a given number of 
cUents, the server must contain sufficient resources to handle bursts of communications 
from the clients in excess of the average bandwidth. Thus, the credit-based methods and 
systems for regulating flow between a sender and the receiver can be used to faciUtate 

1 0 server resource planning. 

According to another aspect of the invention, the sender may transmit in-band 
information to the receiver along with the data packets. The in-band information may 
. include thecumulativeamountofdataremainingtobesentbythesender. The credit list 
builder/communicator 83 may utilize this information to switch between one or more of 
15 the approaches previously described for determining when to communicate new credits to 
the sender. Figure 6 Ulustrates an exemplary approach for switchmg between modes for 
detenmning when to communicate new credits to the sender. In step STl, the credit list 
builder/communicator 83 operates in a first mode for determming when to communicate 
new credits to the sender. The first mode may conq)rise the steps illustrated in Figure 4 
2 0 for sending new credit lists when a first buffer in a previous credit list triggers the sending 
of a new credit Ust, In step ST2, the credit Ust builder/communicator 83 analyzes the in- 
band data transmitted from the sender. The in-band data may include any information for 
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assisting the credit list builder/communicator 83 in detennining when to communicate 
new credits to the sender. For example, the in-band data may include the amount of data 
remaining to be sent by the sender. 

In step ST3, the credit list builder/communicator 83 may determine whether 
5 switching would increase performance, e.g., by analyzing the amount of data remaining 
to be sent, the number of credits remaining, and/or frequency of bufifer usage. For 
example, the credit list builder/communicator 83 may compare the monitored information 
to a threshold value or a set of threshold values to determine whether to switch modes. If 
the analysis indicates that the rate at which credits are currently being communicated to 
10 the sender is too fast, the credit Ust buUder/communicator 83 may switch to a second 

mode for detenninmg when to communicate new credits to the sender to slow the rate at 
which credits are communicated to the sender, (step ST4) On the other hand, if the 
analysb indicates that the communication of credits is too slow, the credit list 
builder/communicator 83 may switch to a third mode for detennining when to 
15 communicating new credits to the sender to increase the rate of communication of credits 
to the sender. If switching modes would not increase performance the credit list 
builder/communicator 83 may continue operating in the current mode (step ST5) and 
return to checking the in-band information. The in-band information may also be used as 
an input to an adaptive-predictive filter to ads^jtively determine when to conununicate 
2 0 new credits to the sender. By using in-band information from the sender to determine 

when to send new credit messages, the I/O performance of the sender and receiver may be 
further improved. For instance, the in-band information may be utilized to reduce the 
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number of mode switches when the data rate is highly variable. 

Figure 7(a) and 7(b) illustrate steps that may be performed by the credit Ust 
reader/processor 75 according to an embodiment of the present invention. Figure 7(a) 
illustrates exemplary steps that may be performed by the credit list reader/processor 75 
5 for receiving credits from the receiver and posting buffers to receive new credits. Figure 
7(b) illustrates exemplary steps that may be performed by the credit list reader/processor 
75 for processing credits to send data to the receiver. The steps illustrated in Figure 7(a) 
may be executed concurrently with the steps illustrated in Figure 7(b). In Figure 7(a), the 
credit Ust reader/processor 75 posts a first buffer for receiving credits and notifies the I/O 
10 device 70 of the postmg. (step STl) For example, the credit list reader/processor may 
write the >drtual address of a descriptor pointing to the buffer for receiving credit 
messages to the receive queue of the virtual interface of the sender for sending and 
receiving credit messages and ring the associated doorbell. Alternatively, where credits 
are communicated to the sender using shared memory or RDMA write operations, the 
1 5 sender may ensure that buffer space in memory exists for receiving credits. The buffer is 
preferably posted before a connection is established between the receiver and the sender. 
In steps ST2 and ST3, the credit list reader/processor checks for credits transmitted firam 
the receiver. Checking whether credits have been received may include reading the 
memory location or locations reserved for receiving credits. If cr«Kts have not been 
2 0 received, the credit list reader/processor may contmue checking or w^ting to be notified 
of the reception of credits. If credits have been received, the credit list reader/processor 
75 m^ post a second buffer for receiving new credits from the receiver (ST4). Once the 
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second bviffer has been posted, the credit list reader/processor may store the credits from 
the first buffer in a credit list and returns to step ST2 to check for credits in the buffer 
posted in step ST4. (step ST5) 

In Figure 7(b), the credit list reader/processor 75 receives a request for sending 
5 data originating from the sending application, (step STla) The request may comprise a 
" Winsock" sendQ function including a buffer virtual address and a buffer size. In steps 
ST2a and ST3a, the credit list reader/processor 75 determines if the credit list contains 
any credits. If the credit list does not contain any credits, the credit list reader/processor 
75 preferably continues checking, i.e., without requesting sending of the data. If the 
1 0 credit list contains credits, the credit message receive/processor 75 may request the 

sending of data corresponding to the size indicated by the first credit, (step ST4a) The 
credit list reader/processor preferably then removes the first credit from the credit list, 
(step ST5a) The credit list reader/processor 75 may then update a data pointer pointing to 
the data to be sent and check whether any data remains to be sent (steps ST6a and ST7a). 
15 If data remams to be sent, the credit list reader/processor 75 may return to step ST2a to 
use the next credit in the credit list if additional credits are available. If no data remains 
to be sent, the credit list reader/processor may return to step STla to receive the next 
request for sending data from the sending application. 

By only sending data when credits are available, the credit list reader/processor 
20 reduces the likelihood of data transmission overflow conditions. In addition, because the 
credit list reader/processor 75 posts a new buffer for receiving credits before using newly- 
received credits, the credit list reader/processor 75 allows the credit list 
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builder/commuiucator 83 of the receiver to maintainthat the sender has a buffer available 
for receiving credits if the receiver receives data corresponding to the first credit in a new 
credit Ust Hius. embodiments of the present invention may also reduce the likelihood of 
data transmission overflov. conditions in transmitting credit messages ftom the receiver 
5 to the sender. 

The credit Ust reader/processor according to the present mvention is not limited to 
the embodiments iUustrated in Figures 7(a) and 7(b). For example, as stated above, the 
credit list reader/processor 75 may control the transmission of in-band data from the 
sender to the receiver. The in-band data may be transmitted along with data packets from 
10 the sender to the receiver. The in-band data may include any data to assist the receiver in 
determining vsrhen to transmit credit messages to the sender. For example, the in-band 
information may include the amount of data remaining to be sent by the sender. 

Figure 8 is a flow diagram illustrating an example of the transfer of data and 
credit messages between the sender 60 and the receiver 62. Each row in the flow 
15 diagram uxdicates status mfomiation and action taken by the sender 60 and the receiver 
62. m first column CI in the diagram represents the receiver 62. including the credit 
Ust builder/communicator 83. the second column C2 represents the communication Imk 
64. and the third column C3 represents the sender 60 including the credit Ust 
reader/processor 75. In row Rl. column C3. the sender 60 has a send buffer 68(a) of 
20 seventy.twobytestosendtothereceiver62. Hxesendbuffer 68(a) may have been 
communicated to the credit Ust reader/processor 75 by the sending application of the 
sender. PB is a pointer to the first byte ofthe send buffer 68(a). Ilie sender initially has 
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zero credits. In row Rl, column CI, the receiver has two available receive buffers 78(a) 
and 78(b) of size three bytes and seven bytes, respectively. In row Rl, coliram C2, the 
receiver 62 sends a credit message 82(a) to the sender 60 indicatmg the size of the buffers 
78(a) and 78(b). The sender maintains two pointers, Rtrip and Next. Rtrip points to the 
5 first buffer indicated in a credit message sent to the sender. The pointer Next points to the 
first receive buffer not communicated to the sender in a credit message. Since there are 
no buffer sizes that have not been transmitted to the sender in row Rl , the pointer Next 
equals zero. 

The pointer Rtrip may be used to indicate to the sender when new credit message 
1 0 buffers are available at the sender for receiving credit messages. For example, described 
above, if data is received in a buffer in a list of credits previously communicated to the 
sender, the receiver can assume that the sender has posted a new buffer for receiving 
credit messages. Accordingly, when data is received in the fust buffer, the credit 
message sent to the sender, the pointer Rtrip may be set to NULL, (see row R9, column 
15 CI) 

In row R2, column C3, the credit list reader/processor 75 of the sender receives 
the credit message 82(a) and adds credits of three and seven to the credit list. The credit 
list reader/processor 75 posts a first descriptor to the send queue of the sender to send the 
first three bytes of the send buffer 68(a) and updates the buffer pomter PB to point to the 

2 0 next byte to be sent 

In row R3, column C2, the sender sends a data packet 84(a) containing the first 
three bytes of the send buffer 68(a) to the receiver. In row R3, column C3, the credit Ust 
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reader/processor 75 removes the used credit of three from the credit list, posts a descriptor 
in the send queue to send the next seven bytes of the send buffer 68(a) and updates the 
buffer pointer PB. The shaded bytes in the send buffer 68(a) indicate data that has been 
sent to the receiver. Thus, in row R3, column C3, the first three bytes of the send buffer 
5 68(a) have been sent. In row R3, column CI, the credit list builder/communicator 83 has 
received a request for receiving data in a new buffer 78(c) of size twenty-two. Since this 
buffer 78(c) has not been conmnmicated to the sender, the credit list 
builder/communicator preferably updates the pointer Next to point to this buffer. 

In TOW R4, column C2, the sender sends a data packet 84(b) containing the next 
1 0 seven bytes of the send buffer 68(a). In row R4, column C3, the sender credit Ust 

reader/processor 75 preferably removes the used credit of seven from the credit Ust. In 
this state, the sender has no credits and preferably does not send any more data until 
receivmg more credits from the receiver. In row R4, column CI, the receiver receives the 
data packet 84(a) into the receive buffer 78(a). The credit list buUder/communicator 83 
15 has received a request for receiving data into a new buffer 78(d) of size forty-four. 

In row R5, column C3, the sender is idle because it has no credits. In row R5, 
column C2, the receiver sends a credit message 82(b) containing credits of twenty-two 
and forty-four to the sender. In row R5, column CI . the pointer Rtrip is iq)dated to point 
to the receive buffer 78(c) corresponding to the first cre<Kt in the new credit message 
2 0 82(b). The credit list builder/communicator 83 sets the pointer Next to zero because there 
are no credits that have not been transmitted to the sender. The buffer 78(a) has been 
removed from the buffers available for receiving data because the descriptor for that 
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buffer has been processed. The receive buffer 78(b) receives the data packet 84(b). 

In row R6, column C3, the sender receives the credits twenty-two and forty-four. 
The credit list reader/processor 75 posts a descriptor in the send queue to send the next 
twenty-two bytes of the send buffer 68(a). The credit list reader/processor 75 updates the 
5 buffer pointer PB. In row R6, column CI, the receiver removes the buffer 78(b) from the 
list of buffers available for receiving data since it previously received data. 

In row R7, column C2, the sender sends a data packet 84(c) containing the next 
twenty-two bytes of the send buffer 68(a), In row R7, column C3, the credit list 
reader/processor 75 removes the used credit of twenty-two from the credit list. The credit 

10 list reader/processor 75 posts a descriptor in the send queue to send the next forty bytes of 
data to the receiver and updates the buffer pointer PB. In row R7, column CI , the credit 
list builder/conmiunicator 83 receives a request for receiving data into a new buffer 78(e) 
of size eleven. The credit list builder/communicator 83 updates the pointer Next to point 
to the new buffer 78(e). 

15 In row R8, column C 1 , the receiver receives the data packet 84(c) in the receive 

buffer 78(c). In row R8, colimm C2, the sender sends a data packet 84(d) containing the 
last forty bytes of the receive buffer 68(a) to the receiver. In row R8, column C3, the 
credit list reader/processor removes the used credit of forty-four from the credit list The 
use of a forty-four-byte credit to send a buffer of forty bytes illustrates an acceptable, but 

20 inefficient use of a credit Since all ofthe data has been sent, the buffer pointer PB is set 
to NULL. 

In row R9, column CI, the receiver receives the data packet 84(d) in the send 
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10 



15 



buffer 78(d). The receive buffer 78(d) is removed from the set of receive buffers 
available to receive data since its descriptor has been processed. The pomter Rtrip is set 
to NULL to indicate the availabiUty of a credit message buffer at the sender. Because the 
receiver sends a Ust of credits to the sender, the sender sends data to the receiver based on 
the size and order of the credits, and the receiver receives data into buffers according to 
the size and order of the credits, reliable flow control between the sender and the receiver 
can be achieved with reduced copying of data. 

In view of the many possible embodiments to which the principles of this 
invention may be appUed, it should be recognized that the embodiments described herein 
with respect to the drawing figures are meant to be illustrative only and should not be 
taken as limiting the scope of invention. For example, those of skiU in the art will 
recognize that the elements of the illustrated embodiments shown in software may be 
implemented in hardware and vice versa or that the illustrated embodiments can be 
modified in arrangement and detail without departing from the spirit of the invention. 
Therefore, the invention as described herein contemplates all such embodiments as may 
within the scope of the following claims and equivalents thereof. 
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CLAIMS 
I claim: 

1 . A method for controlling data flow between a sender and a receiver 
comprising: 

5 communicating a first credit list to a sender, the first credit list comprising a 

plurality of credits indicative of buffer sizes of receive buffers accessible by a receiver 
and capable of receiving data fi-om the sender; and 

in response to receiving the first credit list, transmitting a data packet, firom 
the sender to the receiver, the data packet bemg no greater in size than a first buffer size 
1 0 specified by a first credit in the first credit list 

2, The method of claim 1 comprising receiving the data packet into a receive 
buffer corresponding to the first credit and having the first buffer size. 

15 3. The method of claun 1 wherein transmitting the data packet includes 

transferring data from an application-level send buffer to an input/output device without 
copying the data between the application-level receive buffer and the input/output device. 

4. The method of claim 2 comprising after receiving the data packet, 
2 0 commimicating a second credit list to the sender. 

5. The method of claim 3 wherein transferring data fcom the application-level 
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receive buffer to the input/output device includes posting a descriptor in a send queue 
associated with the input/output device and ringing a doorbell associated with the send 
queue. 

5 6. The method of claim 1 wherein commimicating the first credit list to the 

sender includes transmitting a first credit message including the first credit list firom the 
sender to the receiver. 

7. The method of claim 1 wherein communicating the first credit list to the 

0 sender includes writing the first credit list mto a memory buffer shared by the sender and 
the receiver. 

8. The method of claim 1 wherein the sender executes on a first computer 
and the receiver executes on a second computer and communicating the first credit list to 

L 5 the sender includes performing a remote direct memory access write operation firom the 
second computer to memory of the first computer to write tiie first credit list to the 
memory of the first computer^ 

9. The method of claim 1 comprising establishing at least one first 

2 0 connection for transmitting data packets between tiie sender and the receiver and 

estabUshing a second connection between the sender and the receiver for transmitting 
credit messages between the sender and the receiver. 
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5 



10. The method of claim 9 wherein establishing at least one &st connection 
includes establishing a plurality of first connections for transmitting data packets 
between the sender and the receiver. 

1 1 . The method of claim 1 0 comprising multiplexing credits messages on the 
second connection, the credit message including credits indicating receive buffer sizes 
for each of the plurality of first connections. 



10 12. A credit list builder/communicator comprising computer-executable 

instructions embodied in a computer-readable medium for perfomiing steps comprising: 
receiving requests for receiving data into a plurality of receive buffers accessible 
by a receiver and capable of receiving data from a sender; 

in response to the requests, building a credit list including a plurality of credits 
1 5 indicative of sizes of the plurality of receive buffers; and 
commxmicating the credit list to &e sender. 

13. The credit list builder/communicator of claim 1 2 comprising computer- 
executable instructions for determining when to communicate credits to the sender. 



20 



14. The credit list builder/communicator of claim 13 wherein flie computer- 
executable instrucdons for determining when to communicate credits to the sender 
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include instructions for monitoring a frequency of credit usage by the sender. 

1 5. The credit list builder/commumcator of claim 13 wherein the computer- 
executable instructions for determining when to communicate credits to the sender 
5 include instructions for setting a maximum allowable bandwidth for receiving data from 
the sender and refraining from communicating additional credits to the sender when the 
additional credits would allow the sender to exceed the maximum allowable bandwidth. 

16. The credit Ust builder/communicator of claim 12 wherein the computer- 

1 0 executable instructions for communicating the crecKt list to the sender include instructions 
for transmitting a credit message including the credit list from the receiver to the sender. 

17. The credit Ust builder/communicator of claim 12 wherein the computer- 
executable instructions for communicating the credit list to the sender include mstructions 

15 for writing the credit list to a memory buffer shared by the sender and the receiver. 

18. The credit list builder/communicator of claim 12 wherein the sender 
executes on a first computer and the receiver executes on a second computer and the 
computer-executable instructions for communicating the credit Ust to the sender include 

2 0 instructions for performing a remote direct memory access from the second computer to 
memory of the first computer to write the credit Ust to the memory of the first computer. 
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19. The credit list builder/communicator of claim 12 wiierein the computer- 
executable instructions for communicating the credits to the sender include instructions 
for inserting the credit list in an options field in a TCP packet 

5 20. A data structure for controlling data flow between a sender and a receiver 

comprising: 

a credit list including a plurality of credits, each of the credits being indicative 
of a buffer size of a receive buffer accessible by a receiver and capable of receiving data 
firom a sender. 

10 

21 . The data structure of claim 20 wherein the plurality of credits are arranged 
in an order corresponding to an order of posting of descriptors in a receive queue of the 
receiver. 

1 5 22. The data stracture of claim 20 wherein the credit list is included in a credit 

message transmitted from the receiver to the sender through a network. 

23 . The data structure of claim 20 wherein the credit list is stored in a memory 
buffer shared between the scndst and the receiver. 

20 

24. The data structure of claim 20 wherein the credit list is included in a 
remote direct memory access write packet transmitted from the receiver to the sender 
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tfaiough a network. 

25. The data structure of claim 20 wherein the wedit list is included in an 
options field of a TCP packet transmitted from the receiver to the sender. 

5 

26. A credit list reader/processor comprising computer-executable instructions 
embodied m a computer-readable medium for perfonning steps comprising: 

posting a first buffer accessible by a sender for receiving credits firom a 

receiver; 

1 0 determining whether credits have been received in the first buffer, 

in response to receivmg credits in the first buffer, posting a second buffer 
accessible by the sender for receivmg additional credits ftom the receiver, and 

after posting the second buffer, storing credits from the first buffer in a credit 

list 



15 



20 



27. The credit list reader/processor of claim 26 wherein the credits received in 
the first buffer are arranged in a first order and the computer-executable instructions for 
storing the credits in the credit Ust comprise instructions for storing the credits in the first 
order. 

28. The credit list reader/processor of claim 26 comprising computer- 
executable instructions for , after storing the credits in the credit Ust, requesting 
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transmission of data packets to the receiver, the data packets having sizes controlled by 
the credits in the credit list 



29. The credit list reader/processor of claim 28 comprising computer- 
5 executable instructions for removing a credit from the credit list after requesting 

transmission of each data packet to the receiver. 

30. The credit list reader/processor of claim 29 comprising computer- 
executable instructions for, when the credits in the credit list are exhausted, delaying 

1 0 requesting of transmission of data packets to the receiver until new credits are received 
from the receiver. 



3 L A credit list builder/commimicator comprising computer-executable 
instructions embodied in the computer-readable medium for performing steps for 
1 5 determining when to communicate new credits to a sender comprising: 
communicating a first credit list to a sender; 

determiiung if data has been received in a first buffer corresponding to a 
credit in the first credit list; and, 

in response to determining that data has been received in the first buffer, 
2 0 communicating a second credit list to the sender. 

32, The credit list builder/communicator of claim 3 1 herein the first buffer 
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corresponds to a first credit in the first credit Ust. 

33. The credit list builder/communicator of claim 31 wherein the first buffer 
corresponds to a final credit in the first credit list 

5 

34. The credit list builder/commimicator of claim 3 1 wherein the first buffer 
corresponds to a credit between a first credit and a final credit in the first credit list 

35. A credit list builder/communicator comprising computer-executable 
10 instructions embodied in a computer-readable medium for performing steps for 

determining when to communicate new credits to a sender comprising: 
communicating a first credit list to a sender; 

monitoring a frequency at which the sender consumes credits in the first credit 



15 



20 



list; and 

determining when to communicate a second credit list to the sender based on the 
frequency. 

36. The credit list buUder/communicator of claim 35 wherein the computer- 
executable instructions for determinmg when to communicate the second credit list to the 
sender include instructions for computing a time in time units for communicating tiie 
second credit list to the sender. 
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37. The credit list buildei/communicator of claim 35 wherein the computer- 
executable instructions for determining when to communicate the second credit list to the 
sender include instructions for determining a triggering buffer in the first credit list for 
triggering communication of the second credit list to the to sender. 

38. A credit list builder/communicator comprising computer-executable 
instructions embodied in a computer-readable medium for performmg steps comprising: 

operating in a first mode for determining when to communicate new credits to 

a sender; 

receiving in-band information firom the senden 
analyzing the in-band informadon; and 

if the in-band information indicates that switching would increase 
mput/output performance, switching to a second mode for determining v*en to 
communicate new credits to the sender. 

39. The credit list builder/communicator of claim 38 wherem the in-band 
infonnation includes an amount of data remaining to be sent fix>m the sender to the 
receiver. 

40. The credit list builder/communicator of claim 38 ccn^rising computer- 
executable instructions for refiaining &om switching firom the first mode to the second 
mode if a variance in the rate for receiving data packets firom the sender exceeds a first 
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value, based on the in-band infonnation. 

41 . An input/output device comprising: 
a processing circviit; 

5 a memory device coupled to the processing circuit, the memory device 

storing a credit Ust builder/communicator including computer-executable instructions for 

performing steps comprising: 

receiving requests for receiving data mto receive buffers stored at virtual 
memory locations of a host computer connectable to the input/output device; 
10 building a credit list includmg a plurality of credits indicative of sizes of 

the receiver buffers; and 

communicating the credit list to the sender. 



25 42. An input/output device comprising: 

a processing circuit; 

a memory device coupled to the processing circuit, the memory device 
storing a credit list reader/processor including computer-executable instructions for 

performing steps comprising: 
2 0 posting a first buffer accessible by a sender for receiving credits ftom a 

receiver; 

determming whether credits have been received in the first buffer, 
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in response to receiving credits in the first buffer, posting a second buffer 
accessible by tiie sender for receiwig additional credits from the receiver; and 

after posting the second buffer, storing credits from the first buffer in a 

credit list 

43. The input/output device of claim 42 "wdxerdn the credit list 
reader/processor comprises computer-executable instractions for reading the credits in the 
credit list and requesting sending of data packets to the receiver having sizes based on the 
credits. 

44. A network communications system comprising: 

a first local virtual interface for sending data to and receiving data from a 
first remote virtual interface ov» a first network connection; 

a second local virtual intaface for sending crecQt messages to and 
receiving credit messages from a second remote virtual interface over a second network 
connection; and 

a credit list buildCT/communicator for buil<&ig credit messages for 
controlling data flow over the first networic connection and communicating the credit 
messages to the second remote virtual interface through the second local virtual interfece 
and tiie second networic connection, the credit messages including credit lists mcluding a 
plurality of crests indicative of buffer sizes of receive buffers for receiving data flirough 
the first local virtual interface from the first remote virtual inter&ce. 
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45. The network communications system of claim 44 comprising a credit 
message reader/processor for reading credit messages received from the second remote 
virtual interface through the second network connection and the second local virtual 
interfece and requesting sending of data packets to the first remote virtual interface 
through tiie first local virtual interface, tiie data packets being having sizes based on 
credits in tiie credit messages received from the second remote virtual interfece. 

46. The network communications system of claim 44 comprising a plurality of 
first local virtual interfaces for sending data to and receiving firom a plurality of first 
remote virtual interfaces tiirough a plurality of first network connections. 

47. The network communications system of claim 45 comprising a pluraUty of 
first local virtual interfaces for sending data to and receiving from a plurality of first 
remote vutual interfaces tiirough a plurality of first network connections. 

48. The network communications system of claim 46 wherein the credit list 
buflder/communicator builds credit messages for controlling data flow over tiie plurality 
of first network connections, tiie credit messages for controlling data flow over tiie 
plurality of first network connections including cretUts indicative of buffer sizes for 
receiving data through tiie plurality of first local virtual interfaces. 
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49. The network communications system of claim 48 wherein the credit list 
builder/communicator multiplexes and sends the credit messages for controlling data 
flow over the plurality of first network connections to the second remote virtual interface 
through the second local virtual interface and the second network cmmection, wherein 
each of the plurality of credit messages for controlling data flow over the plurality of first 
network connections indicates one of the plurality of first remote virtual interfaces to 
which flie credits in the credit message pertain. 

50. The network communications system of claim 47 wdierein the credit 
message reader/processor receives multiplexed credit messages for controlling data flow 
over the plurality of first network connections, demultiplexes the credit messages and 
sends data packets to the plurality of first remote virtual interfaces, the data packets 
having sizes based on credits in the credit messages for controlling data flow over the 
plurality of first network connections. 

5 1 . The network commimications system of claim 46 comprising a plurality of 
second virtual interfaces, each of the second virtual interfaces for receiving credit 
messages for controlling data flow over one of the plurality of first network connections. 



- 62 - 



wo 00/41365 



PCTAJS99/30860 



1/9 




SUBSTITUTE SHEET (RULE 26) 



wo 00/41365 



PCT/US99/30860 



74, 



78 



80 



83 



76 



receiver 
62 




receiving 
application 



receive buffer 



receive buffer 



I/O device 
interface 



credit list 
builder/ 
communicator 



I/O device 



2/9 



credit 

message 



84 



data 



communication 
link 



FIG. 2 



SUBSTITUTE SHEET (RULE 26) 



sender 
60 



sending 
application 



send buffer 



send buffer 



I/O device 
interface 



credit list 
reader/ 
processor 



I/O device 



66 



^68 



72 



.75 



70 



oriooooiD. iwo. 



wo 00/41 365 PCTAJS99/30860 




wo 00/41365 



PCTAJS99/30860 




oiiooooiD. iinfo owicwiiPiij_* 



wo 00/41365 



PCT/US99/30860 



credit message 
builder/communicator 

SI 



5/9 



Receive request for 
receiving data firom 
application and add 
credits to credit list 



ST1 



Detenmine whether 
sender has no credits 

or number of 
accumulated credits 



predetermined number 



No 




ST2 



ST3 



Instruct I/O device 
to communicate first 
credit Hst to sender 



ST4 



Determine if data 
has been received 
in first buffer in first 
credit list 




ST5 



ST6 



Oetennineif new 
credits are available 



ST8 



ST9 




No 



ST1 



Communicate new 
credit list to sender 



FIG. 4 



I 



Determine if data 
has been received 
in first buffer of new 
credit Hst 




Yes 



ST11 



SUBSTITUTE SHEET (RULE 26) 



BNSOOCID: <WO 004136SA1 I > 



wo 00/41365 



PCT/OS99/30860 



credit list 
builder/commuiUcator 
S3i 



6/9 



ST1 



Receive request for 
receiving data from 
application and add 
credits to credit list 



ST2 



Determine whether 
sender has no credits 

ornumt>erof 
accumulated credits 
exceeds 
predetennined number 



ST3 



ST4 



ST5 




Instruct I/O device 
to commufticate 
first credit ast 
to sender 



Determine if 
data has t>een 
received in first 
buffer in first credit list 




ST7 



Monitor and 
detemnine 
frequency 
of buffer usage 



ST8 



ST9 



ST10 



ST11 



ST13 



ST14 



Detemnine 
triggering buffer 
based on 

frequency 



Determine if 




triggering buffer In 


^ 


current credit list 




receives data 




y^ztSL received in^ 


^ No 



igering buffer?^ 



Yes 



Determine If new 
credits have been 
accumulated 




Communicate new 
credit Dst to sender 



Determine if 
first buffer in 
new credit 
list has 
received data 



ST15 




FIG. 5 



SUBSTITUTE SHEET (RULE 26) 



ortooooioi ivro_ 



_lWWIOOQftI_l_* 



wo 00/41365 



PCT/US99/30860 



7/9 



credit message 
buiider/communicator 

SI 



ST1 



Operate in 
first mode 
for determining 
wtwnto 
communicate new 
credits to sender 



ST2 



STS 



Analyze 
in-band 
infomnation 
received from 
sender 



Continue utilizing 
cuHBnt mode for 
detemtining wtien 
to communicate 
new credits to 
sender 



STS 



No 



ST4 




Switch modes for 
detemnining wtien 
to communicate 
new credits to 
sender 



FIG. 6 



SUBSTITUTE SHEET (RULE 26) 



RMsnnnin; <wo 



0041365A1 I > 



wo 00/41365 



PCT/US99/30860 



8/9 



credit message 
reader/processor 
75 



ST1a 



ST1 



Post first buffer 
fbr (br receiving 
credits from 
receiver 



ST2 



Check for credts 
recced in first first 

buffer 



ST3 



ST4 




Post second buffer 

for receiving 
additional credits 
from receiver 



ST5 



Store credits from 
first buffer in credit 

nst 



FIG. 7(a) 



Receive request 
fbr sending data 
from sending 
appfication 



L) 



ST2a 



Determine if 
oedSBst 
contains credits 



ST3a 




ST4a 



Yes 



Requests 
data cone 
to size inc 

fifStC 


ending of 
{spending 
licatedby 
vedit 




f 


Remove used 
credit ikom 
credit Qst 






Update data 
pointer and check 
for remaining 
to be sent 



STSa 



ST7a 




FIG. 7(b) 



SUBSTITUTE SHEET (RULE 26) 



0(«3OO0IO. iHVO. 



wo 00/41365 PCTAJS99/30860 



9/9 



receiver 62 
CI 



communication link 64 
C2 



sender 60 
C3 



NextsO 
78(b) 



R1 



Rtrtp 
^ 78(a) 



U 



NexM) 
78(b) 



R2 



Rtrip 



LJ8, 



(a) 



r 



82(a) 



Credits: 



68(a) 



72 



PB 



Credits: 3, 7 



68(3) 



69 



PB 



Next 
78(c) I 



R3 



22 



78(b) 

1_ Jl. 78(a) 
3 -J 



Credits: 7 



84(a) 



68(a) 



62 



PB 



78(d) 



R4 



44 



Next Tfif M Rtrip 
7.8(c) 78(br^8(, 



1 



* 22 



* 7 



J" 



Credits: 



84(a) 



^68(3) 



62 



PB 



78(d) 



NextsO RWp 78(c) 78(b; 



M 

R5 



44 



22 



44 



22 



-82(b) 



Credits: r~ 68(a) 



84(b) 



62 



PB 



Next»0 
78(d). 



R6 



Rtrfp 
il, 78(c) 



22 



Credits: 22^44j-68(a) 




22 



40 



a PB 



78(e) 



R7 



11 



44 



Rtrip 
I 78(c) 



Credits: 44 



22 



22 



84(C) 



r 



68(a) 




40 



PB 



78(e) 



R8 



11 



44 



Rtrip 
t 78 (c) 



84(d) 



Credits: 



22 



40 



r 



68(a) 




22 40 



84(c) 



PB»NULL 



78(e^ 

R9 



Next Rtrip=NUU. 
^ ^ 78(d) 



Credits: 



8a 



11 



40 



84(d) 



Fig. 8 




22 40 



PB s NULL 



SUBSTITUTE SHEET (RULE 26) 



004136SA1 I > 



INTERN A 1 lOIM AL bE AKCH Klir OR 1 


Inter .lonal Application No 

PCT/US 99/30860 


A. CLASSiRCATION OF SUBJECT MATTER 

IPC 7 H04L12/56 




According to International Patem ClassificatiQn (IPC) or to both national classification and IPC 




B. REU>S SEARCHED 


Minimum documentation saaiched (classification system foflowsd by classification symbols) 

IPC 7 H04L 



Documentation searched other than minimum documentation to the extent that such documents are included in the fields searched 



Electronic data base consulted during the international search (name of data base and, where practical, search terms used) 



C. DOCUME»frS CONSIDERED TO BE RELEVANT 



Category ** 


citation of document, with indication, where appropriate. o< the relevant passages 


Relevant to daim No. 


A 


EP 0 674 414 A (AVID TECHNOLOGY INC) 


1,12,20. 




27 September 1995 (1995-09-27) 


26,31, 






35,38, 






41,42.44 




coluran 11, line 28 -column 13, line 7 




A 


US 5 432 784 A (OEZVEREN CUENEYT M) 


1.12,20. 




11 July 1995 (1995-07-11) 


26.31, 






35,38, 






41,42.44 




column 4, line 43 -column 5, line 16 



□ 



Further documents are listed In the continuation of box C. 



Patent family members are Qsled in armex 



* Special categories of cited documents : 

*A* document defbiing the general state of the art which Is not 
considerBd to tie of particular relevance 

'E* eaiiier document t»Jt published on or after the intemalional 
fifing date 

"L' document wtiich may throw douiits on priority claim(s) or 
which Is cited to establish the publication date of another 
citation or other sp ec ial reason (as s p eci f ied) 

■Q" document referring to an oral dteoiosure, use. exhUtion or 



*T* later document putslished after the intematioi^ filing date 
or priority date and not in conflict with the application but 
dted to understand the principle or theory underiylngthe 



'P' document pt±)(tshed prior to the intemaliortaJ fSing date but 
later than the priority date claimed 



"X* document of particular relevance; the claimed Mention 
cannot t^e considered novel or cannot be constdenad to 
involve an inventive step wtten the document is taken alor^ 

"Y* document of particular relevance: the claimed Invention 

cannot be considered to involve an inventive etep when the 
document is comftsined with one or more other such docu- 
ments, such comtrination l>elng obvious to a person skilled 

in ttwart 

document member of t^w same patent famDy 



Date of the actual compietkm of the intemationai search 

5 June 2000 


Date of maiing of ttie intematicnal search report 

16/06/2000 


Name and mailing address of the ISA 

European Patertt Office. P.B. 561 B Patentlaan 2 
NL-2280HVRrjswijk 
Tel. (431-70) 340-^04a Tx. 31 651 epo nl. 
Fax: (431-70) 340-3016 


Auttwrized officer 

RAMIREZ DE AREL... F 



Fomi PCT/ISAA10 (saoond cheat) (July 19S2) 



BWJOOOID. VffO_ 



... •(. 



INTERNATIONAL SEARCH REPORT 

*.!fonnatlon on patMit family mambers 



xifti Application No 

PCT/US 99/30860 



Patent document 
cited in search reoort 




Publcation 
date 


Patent famBy 
memtwKs) 




Publication 
1 date 

1 


EP 0674414 


A 


27-09-1995 


US 5987501 
US 5799150 


A 

A 


16-11-1999 

25-08-1998 


US 5432784 


A 


11-07-1995 


NONE 







Fomi PCT/ISAaiO (patsnt family anneiO (July 1882) 



